Beta Regression

Team Bee (Beta Regressionists) (Advisor: Dr. Seals)

Anaite Montes Bu, Travis Keep

Introduction

  • Regression analysis is a statistical tool used to explore relationships between variables.

  • Beta Regression: When the dependent variable is a ratio or percentage, constrained between 0 and 1.

Why not Linear Regression?

  • It can predict values outside the range of 0 to 1.
  • It assumes constant variance, which is not typical for bounded data.

Key Assumptions

  • Beta distribution: Assumes the outcome follows a beta distribution, which is flexible for variables limited to (0, 1).

  • Precision Parameter (\phi): Allows control over the variance of the outcome, enabling flexibility for data with differing levels of dispersion.

Beta distribution

The PDF of random variable with a beta distribution is as follows.

f(y) = \begin{cases} \frac{y^{\alpha-1}(1-y)^{\beta-1}}{B(\alpha,\beta)}, & 0 \le y \le 1 \\ 0, & \text{elsewhere} \end{cases} Where B(\alpha,\beta) = \int_0^1 y^{\alpha-1}(1-y)^{\beta-1} \ dy = \frac{\Gamma(\alpha) \Gamma(\beta)}{\Gamma(\alpha+\beta)}.

\alpha and \beta are the shape variables where \alpha > 0 \quad \beta > 0. [1]

Beta Distribution Mean and Variance

\begin{align} E[Y] &= \mu = \frac{\alpha}{\alpha+\beta} \\ V[Y] &= \sigma^2 = \frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)} \end{align} [1]

Introduction of \mu and \phi

For beta regression, it is useful to introduce the following

\mu = \frac{\alpha}{\alpha+\beta} \\ \phi = \alpha + \beta \mu is the mean of the beta regression while the higher the \phi the less the variance or the less spread out the PDF function is. [2]

Revised Beta Distribution

f(y; \mu, \phi) = \frac{\Gamma(\phi)}{\Gamma(\mu \phi) \Gamma((1 - \mu)\phi)} y^{\mu\phi - 1}(1 - y)^{(1 - \mu)\phi - 1},

\quad 0 < y < 1 Where:

  • (μ) is the mean, (ϕ) is the precision (inverse of the variance), (Γ) is the gamma function.

Beta Distribution Variance

\text{Var}(Y) = \frac{\mu(1 - \mu)}{1 + \phi}

  • When \mu is near the extremes, 0 or 1, variance drops. [3]

  • Higher values of ϕ correspond to lower variance, indicating that observations are more precise around the mean in beta regression.

Extended Beta Regression

Bias Correction/Reduction - Type of Estimator:

  • ML (Maximum Likelihood): Standard method, useful but may yield biased estimates in certain conditions.[4]

  • BC (Bias-Corrected): Adjusts estimates to correct for bias, providing more reliable parameter values.

  • BR (Bias-Reduced): Shrinks estimates towards a central value, which can improve predictive performance.

Extended Beta Regression

Beta Regression Trees

  • This extension uses recursive partitioning to model data that might exhibit subgroup-specific relationships.

  • It builds decision trees by splitting data into different subgroups based on the instability of model parameters across partitioning variables.

Methods - Suicide Rates Dataset

Model Approach:

  • Beta Regression was used to model suicide rates as a function of socio-economic factors, appropriate for data bounded between 0 and 1.

  • Dataset: Suicide Rates Overview 1985 to 2016, with variables like HDI, GDP per capita, sex, age group, and generation.

  • Cleaned data by removing outliers using Cook’s distance and leverage analysis.

  • Managed missing values and calculated descriptive statistics.

Methods Cont’d

Model Details:

  • Incorporated interaction terms and adjusted precision (phi) to account for variance differences across groups.

  • Used beta regression trees to capture nonlinear relationships.

Evaluation:

  • Model performance assessed via pseudo R-squared.

  • Software: R for data management and analysis.

Methods - Reading Skills Dataset

Model Rationale:

  • ReadingSkills dataset (N=44): Examines reading scores (0.0–1.0) for 44 children, including 19 with dyslexia and 25 without.

  • Beta regression models the response variable within the (0, 1) range, which suits the bounded reading scores better than normal regression.

Methods Cont’d

Data transformation:

  • The response variable is scaled to (0, 1) and transformed using the logit function. The precision parameter (ϕ) is log-transformed and may vary by predictors like IQ and dyslexia status.

Analysis - Suicide Rates Dataset

Extended Beta Regression

Beta Regression Base Model:

betareg(
  formula = suicide_rate ~ HDI_year + GDP_capita + sex + age + generation,
  data = suicide_dataset,
)

Bias Corrected (BC) Model:

betareg(
  formula = suicide_rate ~ HDI_year + GDP_capita + sex + age + generation | HDI_year + GDP_capita,
  data = suicide_dataset,
  type = "BC",
)

Analysis Cont’d

Extended Beta Regression

Beta Regression Trees:

The beta regression tree shows HDI_year as a key predictor, with specific thresholds creating groupings where higher HDI_year values link to better outcomes and smaller nodes show more variability.

Analysis Cont’d - Beta Regression Tree

Analysis Cont’d

Model Diagnostics

The package betareg allows users to perform both fixed and variable dispersion beta regression [5].

Analysis Cont’d

Model Diagnostics

The cleaned model shows improved fit, with more random residuals, fewer outliers, and reduced data point influence.

Analysis Cont’d

Analysis - Reading Skills Dataset

Regressors

  • IQ (Z-score)
    • Min -1.745
    • Median -0.122
    • Max 1.856
  • Dyslexia
    • Yes
    • No

Analysis Cont’d

Dataset Tweaking

  • Dyslexia
    • No -> 0.0
    • Yes -> 1.0
  • Reading Score
    • 1.0 -> 0.99

Remember dependent variable is in open interval (0, 1)

Analysis Cont’d

Beta Regression Fitting: Bias Corrected (BC)

betareg(
  formula = accuracy ~ dcode * iq | dcode + iq,
  data = ReadingSkillsModel,
  type = "BC",
)

General Linear Regression

glm(
  formula = accuracy ~ dcode * iq,
  family = gaussian(link = "logit"), 
  data = ReadingSkillsModel,
)

logit maps (0, 1) to \mathbb{R}

Analysis Cont’d

Results for Normal Children

Analysis Cont’d

Results for Dyslexic Children

Results - Suicide Dataset

Table 2. Impact of Socioeconomic Factors on Suicide Rates: Base and Bias-Corrected Models (Part 1)
Beta Regression Base Model Bias Correction Beta Regression
Variable Beta (SE) p-value Beta (SE) p-value
Intercept -5.75 (0.121) < 2e-16 -5.49 (1.53) < 2e-16
HDI_year 3.60 (0.160) < 2e-16 3.31 (2.03) < 2e-16
GDP_capita -6.4e-06 (6.44e-07) < 2e-16 -8.7e-06 (7.3e-07) < 2e-16
Sex (Male) 0.81 (0.019) < 2e-16 0.82 (0.018) < 2e-16
Age 25-34 years 0.084 (0.033) 0.0099 0.090 (0.032) 0.0053
Age 35-54 years 0.070 (0.040) 0.080 0.086 (0.040) 0.030
Age 5-14 years -0.96 (0.046) < 2e-16 -0.96 (0.046) < 2e-16
Age 55-74 years -0.16 (0.053) 0.0030 -0.13 (0.053) 0.011
Age 75+ years -0.21 (0.060) 0.00047 -0.18 (0.060) 0.0024
G.I. Generation 0.51 (0.048) < 2e-16 0.49 (0.048) < 2e-16
Generation X -0.22 (0.037) 4.66E-09 -0.21 (0.037) 9.60E-09
Generation Z -0.49 (0.068) 3.03E-13 -0.56 (0.067) < 2e-16
Generation Millennials -0.42 (0.046) < 2e-16 -0.43 (0.046) < 2e-16

Results Cont’d

G.I. Generation 0.51 (0.048) < 2e-16 0.49 (0.048) < 2e-16
Generation X -0.22 (0.037) 4.66E-09 -0.21 (0.037) 9.60E-09
Generation Z -0.49 (0.068) 3.03E-13 -0.56 (0.067) < 2e-16
Generation Millennials -0.42 (0.046) < 2e-16 -0.43 (0.046) < 2e-16
Generation Silent 0.098 (0.037) 0.0086 0.091 (0.037) 0.013
Precision Model
HDI_year 0.64 (0.289) 0.026
GDP_capita 8.31e-06 (1.15e-06) 4.75E-13
Model Fit
Pseudo R-squared 0.4623 0.4625

Results - Reading Skills Dataset

Table 2

Table 2: Association of Reading Skills Score with IQ and presence of Dyslexia
Variable Beta Regression General Linear Regression
β SE p β SE p
Dyslexia -1.446 0.2954 9.767e-07 -1.598 0.2448 8.565e-08
IQ (Z-score) 1.049 0.2718 0.0001132 0.4851 0.2916 0.104
Dyslexia:iq -1.144 0.2768 3.593e-05 -0.5463 0.3145 0.09001


Results Cont’d

Dyslexia’s effect on scores

A child’s odds of answering a reading skills question correctly decreases by a factor of e^{1.446} if they are dyslexic assuming normal IQ.

Results Cont’d

IQ’s effect on scores

  • If a normal child’s IQ increases by 1 standard deviation, their odds of answering a reading skills question correctly increases by a factor of e^{1.049}

  • If a dyslexic child’s IQ increases by 1 standard deviation, their odds of answering a reading skills question correctly decreases by a factor of e^{0.095}

-0.095 = 1.049 - 1.144

Conclusion

  • Effective for proportion data, Ideal for modeling data bounded in the (0, 1) range.
  • Models both mean and precision, managing boundary cases and latent heterogeneity.
  • Bias correction and beta regression trees expand its capabilities.
  • The betareg package in R offers a powerful, flexible framework for analysts.

References

[1]
D. D. Wackerly, Mathematical statistics with applications, 6th ed. Duxbury Press, 2002.
[2]
S. Ferrari and F. Cribari-Neto, “Beta regression for modelling rates and proportions,” J. Appl. Stat., vol. 31, no. 7, pp. 799–815, Aug. 2004, doi: 10.1080/0266476042000214501.
[3]
S. Ferrari and F. Cribari-Neto, “Beta regression for modelling rates and proportions,” J. Appl. Stat., vol. 31, no. 7, pp. 799–815, Aug. 2004.
[4]
B. Grün, I. Kosmidis, and A. Zeileis, “Extended beta regression inR: Shaken, stirred, mixed, and partitioned,” J. Stat. Softw., vol. 48, no. 11, 2012.
[5]
A. Zeileis, F. Cribari-Neto, B. Grün, and I. Kosmidis, “Betareg: Beta regression.” The R Foundation, Apr. 2004.